Memory addressing techniques

ABSTRACT

A method of generating a stream of non-contiguous memory addresses representing contiguous points in logical space is described. The method comprises: generating initializing parameters describing the contiguous points in the logical space; configuring a memory address engine with the initializing parameters; performing an algorithm in the memory address engine according to the initialising parameters to produce a plurality of non-contiguous memory addresses; and collating the non-contiguous memory addresses into the stream of memory addresses for output to a data memory. The present invention has particular application to SIMD processing techniques where there are a plurality of memory address engines.

The present invention concerns improvements relating to memoryaddressing techniques and in particular, but not exclusively, thisinvention relates to generating memory addresses for data held in amemory store for use with microprocessors.

In FIG. 1 there is shown a simple prior art processing system comprisinga CPU 2, a DMA engine 8 and a memory store 4. The CPU 2 is connected tothe memory store 4 by a databus 6. The CPU transmits a memory addressrequest to the DMA engine 8 which pages the memory store 4 to send dataresiding at that memory address to the CPU 2 for processing. The datasent from the memory store 4 may be collected in a memory cache (notshown) before the CPU is ready to process the data sent from the memorystore.

Modern microprocessor architecture is based upon the concept of acentral processing unit (CPU) processing data sent to it from a memorystore. The processed data can then be sent elsewhere for furtherprocessing, returned to storage or to a peripheral for display. Theunprocessed data is kept in the memory store and is made ready forsending to the CPU when the memory store is paged by a memory manager.The memory manager is commonly referred to as a direct memory access(DMA) engine. The DMA engine is the microprocessor component responsiblefor handling memory access requested by the CPU and it receives from theCPU instructions as to what data the CPU requires next, as well as theaddresses in the memory store at which that data is located. Paged datarequested from the memory store is sent along a databus to the CPU forprocessing.

It is an important consideration in microprocessor design that the CPUspends as many clock cycles as possible processing data and not waitingfor data to arrive from the memory store, nor spending clock cyclesworking out what data it needs next. With the advent of multi-threadingprocessors and vector processors, which are able to process data inparallel and carry out multiple instructions at once, the data demandrate of modern CPUs has meant that new techniques for keeping the CPUsupplied with data have had to be invented. One technique is that of amemory hierarchy, which takes advantage of temporal and spatiallocality.

The principle of memory hierarchy relies upon the facts that accessedmemory data will be accessed again quickly (temporal locality), and thatmemory data adjacent to recently-accessed memory data will be accessedsoon (spatial locality). Memory hierarchy is implemented by having aseries of memory caches, starting with the largest main memory store(usually a hard disk drive of many gigabytes), decreasing in size andincreasing in response speed until on-die memory caches of a fewkilobytes running at the same speed as the CPU core.

Even so, having eased the problem of serving data to the CPU there isstill the question of how to speed up the process of generating thememory address calls in the first place. This task has usually beenhandled by the DMA controller referred to above. However in the case ofwhere a complex series of address is required a DMA is generallyunsuited to the role and the CPU itself becomes responsible. With a CPUdesign that only has one execution pipeline, the whole CPU is taken upwith processing memory addresses until as many clock cycles have passedas is necessary to generate this next batch of memory addresses. Modernsuperscalar CPU designs have more than one execution unit and theseexecution units can be programmed to carry out the task of the memorymanager in processing and generating memory addresses.

However, as execution units must be able to take on any type ofprocessing required by the CPU, execution units are normally fullfloating point units able to carry out complex calculations and maycomprise an arithmetic and logic unit (ALU), registers and otherelectronic components of their own. Dedicating such a powerful part ofthe CPU simply to memory address generation is generally considered awaste of CPU processing ability, and requires additional power andgenerates additional heat. These are increasing important considerationsespecially in mobile computing applications. As it is expensive toengineer multiple execution units on the same die, it is currentlynormal for CPUs to have no more than two or three full floating pointexecution units. Thus, allocating even one execution unit to memoryaddress generation represents a significant decrease in processing powerfor the CPU.

Modern applications, particularly in the field of image and soundprocessing, have placed further burden on the memory address generatingprocess due to the high levels of data involved which may need movingeach clock cycle.

Of interest to this invention is the field of medical imaging, and inparticular, imaging a section, taken at an arbitrary angle, of a samplevolume. In the medical imaging field, the accurate and automateddelineation of anatomic structures from image data sequences requiresheavy processing power when manipulation of a scanned volume isrequired. For example, image manipulation may be required after a scanis made using Magnetic Resonance Imaging (MRI), volumetric ultrasound orPositron Emission Tomography (PET) techniques. These techniques arenon-invasive volumetric sampling techniques and brains, hearts,microscopic tissue sections are examples of the types of body volumessampled by these methods.

The image of a scanned body volume is usually composed of many twodimensional (2-D) slices, or planes, of pixels taken at regularintervals producing a 2-D data set. The space between any two pixels inone slice is referred to as the interpixel distance, which represents areal-world distance. The distance between any two slices is referred toas the interslice distance, which represents a real world depth.Successive slices are stacked in computer memory based upon interpixeland interslice distances to accurately reflect the real-world sampledvolume. Additional slices can be inserted between the dataset's actualslices, by various forms of interpolation, so that the entire volume isrepresented as one solid block of data at the appropriate resolution.Once the block of data has been established the pixels in each slicehave taken on volume and are referred to as volume pixels, or voxels. Inother words, a voxel is the smallest distinguishable box-shaped part ofa three-dimensional (3-D) image. This 3-D data set is then a digitalrepresentation of the actual sample volume.

Image manipulation usually involves presenting the image of a slice ofthe sample volume taken at a user-defined angle through the body volume.Calculations need to be carried out to determine which image data points(the required data points) will show in the new viewing plane, andfurther calculations need to be carried out to determine at which memoryaddresses in the memory store the data representing the required datapoints resides.

Each required data point may actually comprise several bytes of datarepresenting, for example, chroma, luminence and opacity parameters—eachof these parameters can in turn comprise several bits representing arange of values. When an image slice of the sample volume is taken, thedata corresponding to each data point on that single slice is stored atsequential memory addresses in the memory store. It is usual to storeimage slice data by means of a progressive scan whereby datacorresponding to one corner of the 2-D image slice is recorded first,and recording ends at the diagonally opposite corner, having scanningprogressively along each row until the entire image slice is captured.In order to obtain the memory addresses of the data corresponding to asingle required data point, the two steps of first working out inthree-dimensional physical space what co-ordinate the new data pointwill be before translating this into an actual memory address in memoryspace needs to be undertaken.

In order to present this new view with known microprocessorarchitecture, the CPU has to carry out the above two steps. As the viewtaken could be at any angle to the sample volume, it is unlikely thatthe data corresponding to the new view is stored in contiguous memoryaddresses in the memory store, as only views which represent a planeparallel to the plane of a scan slice will be stored contiguously. Dueto the simple nature of known DMA engines the CPU either has to providethe DMA engine with every single memory address required or directlyfetch each datum itself. This task represents a significant reduction inprocessing power of the CPU while it is tasked to carry out thiscalculation.

Other applications which require a similar level of memory addressprocessing is include Z-buffering, ray tracing, occlusion matching andsound processing.

It is with a view to addressing the above problems, especially theproblem of off-loading from the CPU intensive memory address generationprocessing that the following invention is presented.

According to one aspect of the present invention there is provided amethod of generating a stream of non-contiguous memory addressesrepresenting contiguous points in logical space, the method comprising:generating initialising parameters describing the contiguous points inthe logical space; configuring a memory address engine with theinitialising parameters; performing an algorithm in the memory addressengine according to the initialising parameters to produce a pluralityof non-contiguous memory addresses; and collating the non-contiguousmemory addresses into the stream of memory addresses for output to adata memory.

By off-loading to an external memory address engine the job ofcalculating required memory addresses, the controller (which iscommonly, but not always, the CPU) is freed to concentrate on processingdata returned from the memory store. Furthermore, the controller onlyhas to set the memory address engine once for a series of memoryaddresses to be computed, which means the controller is able to spend ahigher percentage of its time in conducting processing tasks. Morespecifically, the data processing required by the controller CPU islimited to generating the initialising parameters, which typicallycomprise a few bytes as compared to the massive number of non-contiguousmemory addresses which are generated by the memory engine as a result.

Another advantage of this invention is that historically memory addressengines, such as DMA engines, have not been able to produce a continuousstream of memory addresses made of non-contiguous memory addresseswithout intervention from the controller or CPU. This invention, due tothe algorithmic nature of processing the memory addresses, is able toproduce this beneficial effect, thereby reducing controller input overknown memory address engines.

Advantageously, the generation of the initialising parameters enablesviewing of a logical plane of data taken at an angle through a volume ofdata, and it is also advantageous when generation of the series ofparameters enables viewing of a portion of the logical plane of data.This is particularly true for viewing 2-D images of a 3-D volume.

Preferably the method comprises an algorithm which incorporates thesteps of calculating memory addresses progressively for each data pointon a row in a logical plane and repeating the above calculation for thenumber of rows in the logical plane. This regularity in processingelements of the logical space, enables the task of parameterising thecontiguous points of the logical space for which memory addresses arerequired, to be simplified. Namely, the more regular the distribution ofthe processing elements describing the logical space, the more theaddress processing engine can carry out with minimal instructions fromthe controller.

Rather than computing memory addresses plane by plane in the real-worldplane and only keeping the required memory addresses relating to datapoints in the logical plane, it is far more efficient to calculatememory addresses just for the data points in the logical plane.

Another advantageous feature of the claimed method is achieved if theconfiguring step incorporates specifying the co-ordinates of the initialdata point in a logical plane, specifying a vector for a unit increasein a column of the logical plane, specifying a vector for a unitincrease in the row of the logical plane and specifying column limitsand row limits for the required data in the logical plane.

The above steps outline the basic recursive computational procedure forsolving the problem of how to calculate only memory addresses relatingto the required data points in the logical plane. These values comprisethe core set of initialising parameters for the memory address engineand are a comparatively insignificant computation requirement for thecontroller as compared to the memory address calculation which theyenable the memory address engine to carry out.

For the advantageous feature mentioned above, the step of specifying theco-ordinates of the initial data point on the logical plane may becarried out using Cartesian co-ordinates. Similarly, the steps ofspecifying the column vector and the row vector are preferably carriedout using Cartesian co-ordinates. The logical plane is preferablyrepresentative of a plane through a real-world sample, and so each pointin the logical plane can be represented by an appropriate co-ordinate.Also this means that as each of the co-ordinate axes as independent, theprocessing task can be carried out concurrently for each access at bythe algorithm at the memory address engine.

As the logical plane could be inclined at any angle to the real-worldCartesian co-ordinate space, it follows that the vectors needed totraverse the rows and columns in the logical plane may have a componentin each of these three axes.

It may be desirable for the method to further comprise the step ofoutputting the collated stream of memory addresses to a memory store,typically a Secondary Data Store. The Secondary Data Store can be anymemory, for example main system memory.

In an alternative aspect to the claimed method, performing the algorithmpreferably further comprises the step of checking whether a generatedmemory address is accessible to the memory address engine.

It is usual for memory address engines to have limits imposed on wherein the memory space they can access. By access it is meant reading from,or writing to, that memory address. This can be due to efficiencyreasons, e.g. dividing up the memory space between several memoryaddress engines so that each one can work on a different range of memoryaddresses in parallel.

In the above alternative aspect, when performing the algorithm it ispreferable that this comprises comparing a generated memory address tothat of a predefined range of memory addresses accessible to the memoryaddress engine. This simple check is all that is needed to prevent awaste of processing time in trying to retrieve data from a memorylocation which it is not possible for the memory address engine toaccess. In the event that such an address is detected, the performingstep may further comprise returning a null result. This helps processingas the null result can be displayed for example as a frame around imagedata.

The present invention is more efficient when preferably there are aplurality of memory address engines. In this case, the configuring stepmay further comprise configuring at least one additional memory addressengine with the generated initialising parameters, and the performingstep may be carried out in each of the at least one additional memoryaddress engines to produce the plurality of non-contiguous memoryaddresses.

Splitting memory address generating requirements amongst several memoryaddress engines to process in parallel is a more efficient than onlyusing one memory address engine. In the case of memory address enginesworking in single instruction multiple data (SIMD) mode, each memoryaddress engine will be set with the same initialising parameters.

When using several memory address engines, it is preferable whenperforming the algorithm in each memory address engine that thisincorporates determining whether a generated memory address isaccessible to a neighbouring memory address engine.

In the case where it is accessible, then when performing the algorithmin each memory address engine it is also desirable that this furthercomprises generating a memory address on behalf of a neighbouring memoryaddress engine.

These two feature enable the address calculation task to be distributedover an array of parallel memory address engines which each have accessto their respective memory stores. The slight additional addressingissues are by far compensated for by the advantage of significantincreases in speed.

For the abovementioned, preferably further comprises routing datareturned from a memory store associated with a respective memory addressengine as though it had been returned from a respective memory store ofa neighbouring memory address engine when generating a memory address onbehalf of a neighbouring memory address engine. In this way thedestination of the data which may being obtained from a neighbouringdata store can be correctly directed to the processor or storeassociated with the memory addressing engine from where the data wasexpected. This minimises changes in any data processing algorithm thatoperates on that data.

The routing step may further comprise excluding memory address data frombeing routed to the neighbouring memory address engine. This keeps theseparation between data and addresses which is important for ease ofoperation and also minimises the number of input/output pins, which arerequired for silicon implementation of the routing function.

When generating a memory address on behalf of a neighbouring memoryaddress engine it is preferable that this further comprisessynchronising the data transfer with the neighbouring memory addressengine. Whilst each memory address engine can work independently to adegree, when routing data to a neighbour, this is a universal change inthe data flow for which it is highly advantageous that there issynchronisation between the engines.

In an alternative aspect of the present invention, there is provided amemory address engine arranged to accept initialising parametersdescribing contiguous points in logical space set by an externalcontroller, the memory address engine comprising: an address generatorarranged to generate a plurality of non-contiguous memory addressesaccording to at least one algorithm implemented on the initialisingparameters; and collation means arranged to collate non-contiguousmemory addresses into a stream of output memory addresses for output toa data memory.

According to another aspect of the present invention there is provided amemory address processing system comprising: at least a first and asecond memory address engine, a first and a second primary data storesassociated with the respective memory address engines, a first and asecond secondary data store associated with the respective memoryaddress engines, a databus connecting each memory address engine withits associated primary and secondary data stores, and a data routerassociated with each memory address engine, the data router associatedwith the first memory address engine being arranged to route data fromthe first secondary data store of the first memory address engine to asecond primary data store of the second memory address engine uponinstruction.

The above memory address processing system exploits the advantages ofhaving a plurality of memory address engines to calculate the requiredaddresses quickly. The address calculation is carried out on smallsubsets of the logical space and as such is considerably faster than ifcarried out on one large space. Rather than change the processingalgorithms to deal with the new distributed structure, the provision ofa router between two memory address engines makes the processing fareasier.

In the above system, it is preferable that the instruction for the datarouter associated with the first memory address engine is sent from thefirst memory address engine. This keeps the control local and enablesthe memory address engine to specify when the data router is to routedata to the controller directly from its own local data store and whenit is to route data from its neighbouring memory address engine's localdata store. The latter usually being required when the first memoryaddress engine calculates that the memory address is not in its localdata store.

According to another aspect of the present invention there is provided arouter for use with a memory address processing system comprising aplurality of memory address engines, the router being arranged uponinstruction from a first memory address engine to direct data from amemory store associated to the first memory address engine to a memorystore associated to a second memory address engine.

Such a router has particular application when subdividing the task ofmemory address calculation such that many different processors andmemory address engines are operating together. The route is notcomplicated because it can be arranged to route data in one of twoconfigurations, either directly from the data store of the associatedmemory address engine or from the data store associated with theneighbouring memory address engine. Accordingly, each router can befabricated in silicon relatively cheaply in terms of space and cost andprovides a significant reduction in the computational overhead of theCPU having to work out which returned data is valid and represents whichrequest.

Preferably, only memory data from the first memory store associated tothe first memory engine is directed to a memory store associated to asecond memory address engine. This allows the router to be simplyconstructed and repeated for each memory address engine.

In order that the invention may be more readily understood, theinvention will now be described, by way of example, with reference tothe accompanying drawings in which:

FIG. 1 is a schematic block diagram showing an example of a prior artDMA;

FIG. 2 is a schematic block diagram showing a processing systemaccording to a first embodiment of the present invention;

FIG. 3 is a representation of an arbitrary rectangular patch of voxeldata orthogonal to the z-axis illustrating the different initialisingparameters used in the processing system of FIG. 2;

FIG. 4 shows an example of a possible implementation of the presentembodiment for determining MPEG programming calculations;

FIG. 5 is a schematic block diagram showing second embodiment of thepresent invention which comprises a Secondary Data Movement Controller(SDMC) with primary data store and router;

FIGS. 6 a to 6 d are schematic diagrams showing possible routerconfigurations of the router shown in FIG. 5;

FIG. 7 is a table showing SDMC control and status registers of thesecond embodiment;

FIG. 8 shows detailed architecture of the SDMC of FIG. 5;

FIG. 9 is a flow diagram of a data path utilised in architecture shownin FIG. 8; and

FIG. 10 is a flow diagram of the process used in an alternativeembodiment of the invention comprising a SDMC with an associatedprocessor.

In FIG. 1 there is shown a simple prior art processing system comprisinga CPU 2, a DMA engine 8 and a memory store 4. The CPU 2 is connected tothe memory store 4 by a databus 6. The CPU transmits a memory addressrequest to the DMA engine 8 which pages the memory store 4 to send dataresiding at that memory address to the CPU 2 for processing. The datasent from the memory store 4 may be collected in a memory cache (notshown) before the CPU is ready to process the data sent from the memorystore.

FIG. 2 shows a processing system according to a first embodiment of thecurrent invention. For ease of understanding, the differences betweenthe first embodiment and the prior art are used to explain the firstembodiment. The memory store 4 in FIG. 1 is replaced by a Secondary DataStore (SDS) 14 which is used to store very large 2-D or 3-D data sets.The SDS 14 is connected to a data transfer pipeline 16, which in turn isconnected to a Primary Data Store (PDS) 12. The PDS 12 can be thought ofas being a cache local to the CPU 2 or some other processor, such as anassociated string processor (which is the subject of the Applicant'sco-pending International patent application, published as WO 02/42906).The required 2-D or 3-D data sets are transferred into the PDS prior toprocessing by the processor associated with the PDS. The addressgeneration in this network is handled by a smart DMA engine called aSecondary Data Movement Controller (SDMC) 18 which sends data addressrequests via the pipeline 16 to the SDS 14. The SDMC 18 is set up withaddress instructions from the CPU 2. The CPU which sets up the SDMC 18does not necessarily have to be the same processor which is associatedwith the PDS.

In terms of memory hierarchy described earlier, the SDS can be anymemory, or memory cache, lower down in the hierarchy compared to thePDS. Of particular importance to this particular invention is thescenario wherein the PDS is closely associated to a processor local tothe SDMC 18, and in particular, wherein the processor is a vectorprocessor.

The data transfer pipeline 16 is connected to the SDS 14 by pins 15 and17 allowing simultaneous transmission of addressing data (pins 15) anddata (pins 17). The data transfer pipeline also transfers data to thePDS, and the data bandwidth determines the number of pins present inconnectors 17 and 19.

The SDMC 18 employs a Cartesian co-ordinate system with conventional x,y and z axes. The size and arrangement of the 3-D (volumetric) data setis defined by xSize, ySize and zSize. As previously described, the unitsare in voxels (the 3-D equivalent of pixels), which have aresolution/precision appropriate to the application. For example, thedatum representing a voxel might comprise information about colour,density, transparency etc and would therefore occupy a plurality ofbytes of storage (voxelSize). This would lead to a total 3-D datasetbeing comprised of the following number of bytes:

-   -   voxelSize*xSize*ySize*zSize bytes

The parameters xSize, ySize and zSize are not directly used by the SDMCarchitecture. Instead, users specify how many bytes in memory to skip inorder to move a single voxel along each of the x, y and z axes. Thesequantities are called xScale, yScale and zScale. Typically (although notnecessarily) the values of xSize, ySize and zSize would be identical.

One can interpret any 3-D data set as zSize planes each with dimensionxSize by ySize. The 3-D data is conventionally arranged in the SDS withthe first row (y=0) of the first plane (z=0) physically first in memory.The first data element in this first row is, naturally enough, the oneat x=0, followed by x=1, up to x=xSize−1. This is followed by the secondrow of the first plane, and so on up until the last voxel of the firstplane (x=xSize−1, y=ySize−1). From the last voxel of the first plane,one moves, naturally enough to the first voxel of the second plane (x=0,y=0, z=1), and so on through the data set.

All of this could be expressed more concisely by stating that x is theminor dimension and z is the major dimension. The relationship betweenxSize, ySize and zSize and xScale, yScale and zScale follows from this:

-   -   xScale=voxelSize    -   yScale=xScale*xSize    -   zScale=ySize*yScale

The SDMC 18 facilitates the transfer of voxel data between the SDS 14and the PDS 12. It internally calculates the co-ordinates (x,y,z) ofeach voxel to be transferred and multiplies these by the scale factors(xScale, yScale and zScale respectively) in order to determine theactual memory address of the voxel.

Many parallel processing applications invariably require neighbouringvoxels to be loaded into neighbouring processing elements so that anarray of execution units inter-connected by a communication network canbe used efficiently to share neighbour data during common processingtasks, such as convolutions and matrix multiplication. Accordingly, theSDMC 18 is designed to transfer 2-D “patches of voxels” to and from SDS.

A “patch” is a rectangular arrangement of voxels all lying in the samearbitrary plane within the volumetric data set. This 2-D “patch ofvoxels” is not to be confused with a 2-D image—although under certainhighly specific circumstances the two would be identical. Theorientation of this plane can be arbitrary. In other words, locationsand order of the voxels stored in the SDS involved in the transfer, aredefined by an arbitrarily orientated rectangle conceptually positionedwithin the volume data. Most of the SDMC parameters are needed tospecify of the position, size and orientation of this patch of voxeldata within the volume data.

The patch of voxel data has a column dimension, and a row dimension. Thecolumn dimension is always the minor dimension i.e. data are transferredrow by row—therefore the column index changes the most frequently duringa transfer.

In use, the CPU 2 issues instructions to set up the SDMC 18 with aseries of addressing instructions during initialisation, and by using analgorithm implemented in hardware described in detail later on, anexample of which is described herein. The SDMC 18 is able to generate aseries of required memory addresses which are then paged to the SDS andreturned to the PDS. Hence memory address calculation is offloaded fromthe CPU to the SDMC 18, and the CPU is able to concentrate clock cycleson processing data taken from the PDS.

FIG. 3 shows a graphical example where a patch 20 of data items 21 (onlya portion of data items 21 are labelled in FIG. 3) is chosen, forconvenience of illustration only, to be aligned with the z-plane. Inthis example, the size of the patch is 3×3 data items.

The ColIncr is the vector required to calculate the address of the nextdata element along a row of the “data patch”. The RowIncr is the vectorrequired to calculate the memory address of the next data element in thenext row. ColIncr and RowIncr each have three components, one each forthe x, y and z directions (xColIncr, yColIncr etc). In the diagram,zColIncr and zRowIncr are obviously not represented, since the patch ofvoxels lies on a z-plane in this illustration.

However, in cases wherein the chosen plane is not parallel to thez-plane, each vector ColIncr and RowIncr will have a component in eachof the three axes.

The plane chosen is referred to as the logical plane and the ColIncr andRowIncr vectors are vectors in the physical plane, which when takentogether represent a unit vector in the logical plane.

The CPU 2 works out the x, y and z components for both ColIncr andRowIncr and this is sent to the SDMC 18 during initialisation andmaintained throughout all the following calculations until all the datapoints for the patch of the required are calculated.

A more complete description of the controlling parameters is given inthe following sections.

xInit, yInit, zInit

The position of the rectangular patch of voxel data 20, or matrix, isdetermined by the x, y, z co-ordinates of its first voxel (the voxelwith the minimum column and row co-ordinate within the matrix). Thisposition is specified by the SDMC parameters xInit, yInit and zInit. Thevalues of xInit, yInit and zInit are 32-bit signed fixed point values,with the lower 16 bits representing the fractional part.

ColIncr and RowIncr

The orientation of the rectangular patch 20 is specified by two vectors,ColIncr(x,y,z), and RowIncr(x,y,z). The ColIncr is the vector requiredto move to the next data element along a row of the rectangular patch20. The RowIncr is the vector required to move from a row to the nextrow. ColIncr and RowIncr each have three components, one each for the x,y and z directions (xColIncr, yColIncr etc). These components are 32-bittwo's complement fixed point value, with the most significant 16 bitsrepresenting the integer part and the least significant 16 bitsrepresenting the fractional part.

ColLimit and RowLimit

The number of source/destination data elements within the patch (andhence the size) is specified by ColLimitL, ColLimitH, RowLimitL andRowLimitH. These specify the begin and end limits of the internal SDMCrow and column counters, used to traverse the data elements defined bythe rectangular patch 20, i.e. the column counter counts from ColLimitLto ColLimitH and the row counter from RowLimitL to RowLimitH. In thismanner, a subset of the chosen patch may be transferred at the user'sdiscretion.

The above three sets of parameters, the initial values of the startingvoxel, the unit increment vectors in the logical plane and the logicalplane limits are all that are needed to specify the patch of data in thelogical plane. Further optional parameters may be provided to increasethe functionality of the SDMC.

For example, the SDMC 18 does not have to transfer complete “patches (orrectangles) of data”, the actual number of data elements to betransferred is specified by TfrLength (transfer length). TfrLength maybe less than the number of data elements defined by RowLimit andColLimit, or more probably it will be the same or an integer multiple ofthis value (enabling multiple copies of the rectangular patch's data tobe transferred to the PDS during load operations).

Another example is providing programmable parameters RowBound, ColBoundand NullData. It is often the case that only a subset of a patch of datafrom the SDS is required to be extracted, whilst padding the remainingdata items with some arbitrary NULL value. The SDMC 18 has additionalfunctionality to help select only this subset data from the SDS and fillthe remaining locations with NULL data to be transferred to the PDS.

Similarly, the processing of a patch of data may result in invalidresults being stored back into the local PDS (i.e. at the boundaries ofa patch after processing). Rather than waste computational bandwidth inthe CPU 2 to suppress the generation of these invalid results (notpractical or desirable in a SIMD processor) it is better to write backonly the valid results back to the SDS after processing, discarding anyinvalid boundary data.

Data is only transferred between PDS and SDS when the internal SDMCcolumn count is >=ColBoundL and <=ColBoundH and internal SDMC row iscount is >=RowBoundL and <=RowBoundH i.e. RowBoundH, RowBoundL,ColBoundH and ColBoundL mark the bounds of the internal row and columncounters within which valid data is transferred. When either the row orcolumn counters exceed their respective bounds, the NULL data value isloaded into the PDS during a load operation. During a store cycle thePDS data is discarded and no data is written to the SDS. The NULL datavalue is specified by the NullData SDMC parameter.

FIG. 4 shows an example in MPEG video decoding, where an IDCT (inversediscrete cosine transformation) must be carried out on an 8×8 matrix,the result of which (also an 8×8 matrix) must be added to reference dataproduced from a 9×9 matrix by half pixel interpolation. One approachmight employ 81 SIMD processing elements to handle the computationinstead of 64, to allow enough room to load the 9×9 reference data overthe IDCT results without needing to make any extra space first. In thissituation, one would want to load the entire IDCT input data (64 dataelements) in one transfer into 81 Associative Processing Elements.However in order to facilitate the ensuing matrix multiplications wewould require that each row of 8 from the input data to be left alignedin a row of 9 in the SIMD processor array, with a final blank row of 9at the “end”.

This can be easily achieved with the SDMC 18 by appropriatespecification of the RowBounds and ColBounds parameters. For the MPEGexample the following could simply be set ColLimitL=RowLimitL=0, andColLimitH=RowLimitH=8 and RowBoundL=ColBoundL=0, RowBoundH=ColBoundH=7,remembering to set TfrCount to 81.

With an array of SDMCs processing data in parallel, in a SingleInstruction Multiple Data (SIMD) mode, a second embodiment of theinvention, represented by the processing system shown in FIG. 5 can beemployed. In this parallel configuration, a Secondary Data Transfer(SDT) router 20 is positioned between the pipeline 16 and the PDS 12.When the CPU sets up the SDMC 18 with initial instructions, the CPU alsolets the SDMC 18 know which series of memory addresses each SDMC 18 canaccess.

If, during the course of a first SDMC 18 processing the queue ofaddressing instructions initialised by the CPU the SDMC 18 finds thatthe memory address generated is outside its own allocated local seriesof memory addresses, the SDMC 18 determines which neighbouring SDMC isable to access this memory address and the router accordingly setsitself to route data from a neighbouring SDMC's router (22, 24) to thePDS of the first SDMC 18 via the first SDMC's router 20.

It is important to remember that the SDMC programming parameters enablea global voxel address to be calculated. The global data set may bedistributed across multiple local SDS's, each with their own SDMC andsecondary I-O channel to PDS, thereby offering enhanced secondary I-O(input-output) performance. Many SIMD applications will work by simplypartitioning the data into multiple independent data sets in this way.

However, benefit comes from the ability of neighbouring SDMC channels tohave physical access to each other's local data whilst retaining aglobal addressing scheme to enable them to reference data throughout thesystem SDS. Invariably this assumes local data sets with equal sizes andshapes and data accesses restricted to neighbouring SDSs—a reasonableand cost-effective assumption in SIMD 2-D and 3-D imaging applications.

Therefore the “channel” architecture allows a particular SDMC 18 toeffectively read data from one of its nearest neighbour's SDS. The SDMC18 in question does not directly fetch the data itself, but utilises theneighbour's SDMC to undertake the fetch. In other words, each SDMCissues a global address and fetches the data. However the SDMC,recognising that this data is actually in a neighbouring SDS, applies anappropriate offset to its own global address to create a local addressand routes the resulting data transfer to its neighbour via a simpleneighbour switching network.

In practice this relies on all SDMC channels requiring access toneighbouring SDS data at the same point in the program, so that they areall fetching data from their local SDS on behalf of a neighbour. This isundertaken by employing a handshaking channel between neighbouringSDMCs. In an array of SDMCs working in SIMD mode, each SDMC 18 may be ata slightly different place in the queue of addressing instructions ithas been initialised to process. As each SDMC 18 is operating the sameset of instructions, albeit on a different portion of data, as soon asone SDMC 18 realises the address generated is outside its local datastore it sends a handshaking signal to neighbouring SDMCs. As soon asevery SDMC has reached the same position, and realised that the nextaddress is outside their local SDS, the routers are activated and theeach SDMC processes the address page of the appropriate neighbour, andthe router routes the returned data to that neighbour's PDS.

The handshaking process is critical to the above scenario as in SIMDprocessing architecture if one SDMC requires data held in a neighbouringSDS, then all the processing units will as well. This transfer cannottake place until all the processing units are synchronised at the samestage in the queue of processing instructions.

This embodiment not only shows the scalability of the present inventionbut how, with the router, the size of the local data stores for eachSDMC can be reduced. Without the router function, each SDMC will need tohave an accessible local memory store of the same size as the entire SDSwhich is costly.

Furthermore, the positioning of the router between the data transferpipeline 16 and the PDS 12, and not between the pipeline 16 and the SDS14 means that significantly fewer signals need to be provided. It isclear from FIG. 5 that the router only needs to provide enough pins toenable the data to be transferred to is the PDS 12, whereas situatingthe router between the pipeline 16 and the SDS 14 will necessitateproviding additional pins to transmit the addressing data as well. Theplacing of the router as shown in FIG. 5 considerably reduces the numberof connections—i.e. the pin count—needed in the processing system. Thisequates to an appreciable reduction in circuit board real estate, costand power consumption.

In practice, router activation as shown in FIG. 5 is controlled bysetting parameters WrapBase and WrapLimit. WrapBase and WrapLimit areused by the SDMC 18 to modify the global position vector, which mayrefer to a physical location in a neighbour's SDS, so that it alwayspoints to a corresponding position within the local SDS. The routerenabling bits must be set in the SDMC function register for a transferthat requires this functionality, the routers must be enabledindependently for accesses that exceed the bounds of the local SDS ineach of the x, y or z directions.

WrapBase and WrapLimit also specify what subset of the global SDS aparticular SDMC's SDS occupies—i.e. defines the local SDS particular toeach SDMC 18. WrapBase (xWrapBase, yWrapBase and zWrapBase) specifiesthe global voxel co-ordinate corresponding to the first voxel in thelocal SDS. WrapLimit (xWrapLimit, yWrapLimit, zWrapLimit) specifies thenumber of voxel steps that can be taken along each axis before the limitof the local SDS is reached.

If the volume is divided across multiple local SDSs (each with their ownSDMC 18 and secondary I-O channel to PDS), the xInit, yInit and zInitvalues must be supplied relative to the xWrapBase, yWrapBase andzWrapBase values of a given data channel

If the required memory address of a first SDMC 18 is larger thanWrapLimit then the required address is subtracted from the WrapLimit andthe first SDMC 18 is set to return from the first SDMC's local SDS theaddress of the neighbouring SDMC with a lower local range of memoryaddresses, and this data is routed to the neighbouring SDMC's PDS byrouting through both the first and neighbouring SDMC's routers. The datarequired by the first SDMC 18 will be returned to the first SDMC's PDSby the other neighbouring SDMC through the other neighbouring SDMC'srouter and the first SDMC's router.

Conversely, if the required memory address is smaller than WrapLimitthen the required address is added to the WrapLimit and the first SDMCis set to return from its local SDS data that will be routed to aneighbouring SDMC with a local SDS range higher than its own.

FIGS. 6 a to 6 d show the routing configurations for router 20 supportedunder this scheme. Note that in the example shown, the router 20 onlyoperates during data load cycles (i.e. SDS to PDS transfers). This isbecause in this example the scheme anticipates that a given processorwill try to access “overlap” data from neighbouring data sources, butwill generally only use this to calculate local results (i.e. store toits own local SDS). This restriction need not always apply in thegeneral case.

Other parameters that may be set by the algorithm (implemented inhardware) are upper and lower limits for hard bounds checking—HardBoundHand HardBoundL. The SDMC 18 provides bounds checking for the addressesthat it calculates so that data can be prevented from being read from orwritten to invalid locations. This is particularly useful forapplications that divide their data sets over multiple SDS channels andemploy neighbour data accesses. In this situation, the channelscontaining the “ends” of the data set will inevitably generate addresseswhich lie outside of the global data set.

HardBoundH (xHardBoundH, yHardBoundH and zHardBoundH) and HardBoundL(xHardBoundL, yHardBoundL and zHardBoundL) specify the “hard” boundariesimposed on the global position vector calculated by the SDMC 18. Whenthe hard bound is exceeded no memory read or write accesses aregenerated and the NullData value is returned in place of read data.

The SDMC 18 referred to in FIGS. 5 and 6 has the configuration registersas shown in FIG. 7. The AddressBase referred to in FIG. 8 is the offsetof the data set from the start of the local SDS. This represents asimple offset within the SDS store and has no special relationship withthe 2-D or 3-D data set size or its co-ordinate system.

The detailed architecture of the SDMC 18 for use in the secondembodiment as referred to in FIGS. 5 to 7 is shown at FIG. 8.

Whilst this SDMC is specifically with reference to the secondembodiment, it can readily be adapted for use with the first embodiment.

There are three main units to the circuit shown in FIG. 8 which, inturn, are responsible for generating the x, y and z components of eachof the logical plane vectors RowIncr and ColIncr. To recap above,RowIncr and ColIncr relate to unit vectors in the logical plane and asthe logical plane can be at any angle to the real-world sample beingviewed, RowIncr and ColIncr can have components in all three axes.Detailed circuitry is only shown for the unit responsible forcalculating the x component in each vector calculation of RowIncr andColIncr, although the other two units will comprise exactly the samecomponents. In operation all three units run at the same time togenerate a single resulting memory address comprising outputs from allthree units.

A counting mechanism, referred to as a patch index logic circuit (notshown), provides the clocking signals driving the x, y and z componentprocessing units, and this can be readily achieved by known methods.This counting mechanism indexes through the columns and rows of thelogical plane to keep track of how many columns and rows have beenprocessed and to stop processing when the column limits and row limitshave been arrived at.

The following is described with reference only to the unit responsiblefor calculating the X component. Exactly the same process is followed inthe other Y component and X component units. Upon an appropriate signalfrom the counting mechanism the xColIncr multiplexer starts addingColIncr, which is one of the parameters initialised at the start of theoperation.

The principal module for calculating the global address in {x, y, z}parameter space utilises 32-bit fixed-point arithmetic for theapplication of the RowIncr and ColIncr offsets. In order to implementthe concept in the most efficient manner, the invention supplies theInit (x, y, z) values per channel relative to the local WrapBase (x, y,z). In this manner all address calculations are actually made for thelocal SDS memory, but may readily be recast into the global addressspace (i.e. for hard bounds checking) by simply adding the WrapBaseoffset to the local address.

For applying the wrap offsets, the integer portion of the address (whichis relative to the WrapBase) is compared to the WrapLimit. If theaddress is greater than the WrapLimit or less than 0, then the routerwill be enabled to redirect data to either the right or leftneighbouring PDS' respectively. Under these conditions, since theaddress is not directed via the router, but remains strictly local, itmust be modified to provide the local address for the neighbour routeddata. If the address is greater than the WrapLimit, then WrapLimit issubtracted to calculate the local address. If the address is negativethen it is summed with WrapLimit, again to calculate the local address.

Alternatively, if no router network is needed, then the global addressis passed unmodified to the Scale multiplier.

Next the hard bounds check is performed. The address is first summedwith WrapBase to calculate the global address, then compared to theHardBounds limits.

Subsequently, and in order to allow for non-unity voxel sizes, theaddress is then multiplied by the Scale factor.

Finally the x, y and z components are combined to generate the final SDSaddress. In these final stages the AddressBase offset is applied.

These steps are repeated for every data point in each row of the logicalplane. When the end of a row is reached, the counting mechanism sends adifferent signal to the xRowIncr multiplexer to add RowIncr to theaddress value calculated in that clock cycle for xColIncr.

Once this has been added the rest of the procedure remains the same.Point by point, and row by row, this procedure is followed until all thepoints in the required logical plane, or patch of the logical plane, arecalculated.

There is provision, by way of the multiplexer just before theAddressBase summation, for an alternative source of memory addresses tosupply the memory address at this stage—this is represented by the termAddressMemory. However, this is only present in this figure todemonstrate the flexibility with the present design and provide greatercompatibility with existing network designs. As an example, addressesgenerated by the address generating logic of the SDMC 18 prior tomultiplexing with the AddressMemory multiplexer can be substituted byaddresses generated by an associative processor, or an array ofassociative processors, which instead generate the memory addresses andsupply these to the local SDS. As an associative processor is a vectorprocessor the memory addresses generated would be dumped into a furtherregister or cache (not shown), before being streamed into themultiplexer, as memory addresses can only be handled by the memory storelinearly. It is possible for an array of associative processors to dumpmemory addresses into this register or cache before the addresses arestreamed to the multiplexer.

This process is represented by the flow diagram shown in FIG. 9, whichonly shows the process for one of the co-ordinate axes—in this case theX axis. As will be shown in the description of this process, at a laterpoint the memory addresses calculated in all three axes are summed toproduce the final global memory address.

The co-ordinates of the first point on the logical plane patch, xInit,yInit and zInit are already known and stored during initialisation. Theprocess shown in FIG. 9 begins at step 50 which requires a decision onwhether a new row of data points is to be calculated. If yes, the patchindex logic adds RowIncr to the value in the xRow register and the newvalue is stored in the xRow register. This is represented by thesequence of steps 52, 54 and 56. If the decision at 50 was that a newrow on the logical plane is not being started, then the process jumpsdirectly to step 58, wherein the xCol register value is modified byadding xColIncr—shown by steps 58 and 60.

In practice as the points on the logical plane are calculated row byrow, it is the column increment vector, xColIncr which will be increasedthe most times. XRowIncr is only added to the xRow register value at thestart of each new row on the logical plane patch.

For each data point the xRow register value and the xCol register valueare added together at step 60 to form the MemoryAddress (MA). The MA hasthe fractional part discarded at step 62.

Next, the remaining integer part of MA is compared to the WrapLimit (WL)at step 64. There are three outcomes to this comparison step—MA is lessthan 0, MA is greater than WL or 0<MA<WL. The respective three outcomesare labelled 66, 68 and 70.

If 66 is true then the Router is set to route to the PDS local to theneighbouring SDS with a lower accessible memory address range than thepresent SDS—this is shown at 72. Should 68 be true then the router isset to route the neighbouring SDS with a higher accessible memoryaddress range than the present SDS—as shown at 74. It will rememberedfrom FIGS. 6 a and 6 b that when the router is set to route data to aneighbouring PDS, via the neighbouring router, the router at the sametime sets up to be able to receive into the local PDS data returned fromthe other neighbouring router.

If the router is not activated at all, because the condition 0<MA<WL istrue (step 70), then the router is set to pass-through mode at 76 andthe MA is sent directly to the adding step at 88.

Referring back to 72 and 74, after 72 the MA (which is still a globalMA) is added to the WL to produce a modified local MA 78, and after 74the WL is subtracted from the global MA to produce a modified local MA80.

The MA following from 78 or 80 will then pass through a HardBounds checkstep 82, and if HBoundL>=MA>=HBoundH is true then the modified local MAis out of bounds (step 84). In all other cases 86 holds true and the MApasses to 88 wherein it is added to the MemoryAddresses generated by thecorresponding Y-axis and Z-axis logic.

After 88 the AddressBase (AB) is added to the memory at step 90, whereinafter that is a check step 92 to see if the router is set topass-through. If the router is set to pass-through then the MA is sentto the local SDS to return data through the router at step 96, which inthis case will return data to the local PDS.

Should the router not be set to pass-through, then a handshaking step isactivated at 94 wherein each SDMC will wait until the neighbouring SDMCsare ready and synchronised for inter-SDMC data transfer throughrespective routers. This handshaking step is described in more detailbelow.

Once the handshaking step has been completed the MA is sent to the localSDS to return data through the router at step 96, and in this case thedata will be returned to a neighbouring PDS.

After data has been received by the PDS at 98 the process will restartat step 50 unless either the end of the patch has been reached, or thetransfer length has been exceeded, in which case the SDMC will enter await state.

The handshaking step mentioned above is meant to ensure that all SDMCstransfer data between them in synchronicity. As stated before, in a SIMDscenario only this method will work. Each SDMC, when it has realisedthat a transfer of data from its local SDS to a neighbouring PDS isrequired, sends a notification signal to the neighbouring SDMC to whichit intends to route data to. At the appropriate time the neighbouringSDMC will detect the notification signal and return an acknowledgementsignal. At the same time, the other neighbouring SDMC will have sent thefirst SDMC a notification signal and will be expecting to receive anacknowledgement signal. Thus, as soon as the first SDMC has received anacknowledgement and has sent an acknowledgement signal it knows that itis ready to initiate the inter-SDMC data transfer at the next clockcycle.

During that clock cycle all the SDMCs in the array will have settleddown and be ready and waiting to initiate inter-SDMC data transfer. Ofcourse the amount of time taken to complete the handshaking process willincrease with the number of SDMCs employed in the array—in the casewhere there are too many SDMCs to be able to complete handshaking in thespace of one clock cycle, the controlling CPU can be notified of thisand allow as many clock cycles as necessary to pass, in order tocomplete handshaking for all SDMCs, before allowing inter-SDMC datatransfer.

For the SDMCs at the ends of arrays there is a link to enablehandshaking between the two end SDMCs so that, in effect, with regardsto the handshaking step there is a circle of SDMCs as opposed as a linewith a starting and ending SDMC.

In a further embodiment of the present invention, the hardwareconfiguration is similar to the second embodiment, except the SDMC 18 isconnected to an associated parallel processor. The operation of theprocessing system of the third embodiment is represented by the flowdiagram shown in FIG. 10. At initialisation by the CPU, the CPU setsboth the SDMC 18 and the associated processor with initial instructionsas shown at steps 100 and 102 respectively. The SDMC 18 is set with aseries of memory address transfers and also the parameters needed tocarry out these memory address calculations—step 100. At step 102 theassociated processor is set with a queue of processorfunctions—analogous to procedure or function calls in a conventionalprocessor language scenario. When the initialisation stage has beencompleted the CPU sends a start signal at step 104 and the SDMC 18starts to process the queued memory addresses and outputs each memoryaddress to the local SDS at step 106 and notifies the associatedprocessor that the requested data is present in the PDS at 108. Afterthe processor processes the data in the PDS it returns the processeddata to the PDS again (step 110) and notifies the SDMC at 112 thatprocessed data is present in the PDS ready for the SDMC to move itsomewhere else. In this case the SDMC moves the data back to the SDS atstep 114.

After each batch of data has been processed the processor handshakeswith the SDMC queries whether it has remaining instructions to carry outat 116. If it does, the process loops back to 106 to continue theprocessing cycle. Once the SDMC runs out of further instructions toprocess it sets a status flag at 118 that it is ready to accept the nextbatch of instructions. As is apparent this setup has inherentscalability built in.

Other scenarios not described include those wherein the associatedprocessor processes data and dumps data into the PDS for the SDMC tomove to another data store, without ever needing the SDMC to processdata and move it to the PDS in the first place. An alternative scenariois one whereby the SDMC is only required to process and move data to thePDS without being required to move data out of the PDS.

In the described scenario of a SDMC 18 closely associated to aprocessor, and in particular a parallel processor, further benefits ofthe invention can be realised. It is important for the parallelprocessor that the required data is ready for loading into theprocessor's registers at the required time, and by employing a SDMC 18to move data into the PDS connected to the registers, there is enoughdata present in the PDS to enable the processor to operate at fullspeed, without having to wait for data to be made available in the PDS.

The parameters mentioned above are hardwired into the SDMC 18 to performthe algorithm described. In the present embodiments the application ofthis SDMC 18 is particularly applicable for manipulating 2-D and 3-Dimages. Of course, other algorithms more suited to alternativeapplications of the SDMC 18 mentioned elsewhere may utilise differenthardwired parameters. It is evident from the foregoing description thatany large dataset which requires a set of vector instructions applied tothe dataset is particularly applicable to this invention by breakingdown the dataset into a series of smaller fragments, allocating eachfragment of data to an individual SDMC 18 (and the associated SDS), andsetting each SDMC 18 with the initialising parameters.

Having described particular preferred embodiments of the presentinvention, it is to be appreciated that the embodiments in question areexemplary only and that variations and modifications such as will occurto those possessed of the appropriate knowledge and skills may be madewithout departure from the spirit and scope of the invention as setforth in the appended claims. For example, the present invention is notrestricted to SIMD technology and can be applied to any data-parallelprocessor technology.

1. A method performed within a computer system for generating a streamof non-contiguous memory ‘READ’ addresses representing contiguous pointsin a selected planar surface within a multiplanar image, the methodcomprising: generating, at a CPU of the computer system, initializingparameters mathematically defining an array of contiguous image datapoints within the selected planar surface of the multiplanar image, theinitializing parameters including a plurality of parameterization unitvectors and spatial-coordinate boundary values; configuring a memoryaddress engine with the initializing parameters; performing an algorithmwithin the memory address engine according to the initializingparameters to calculate the set of contiguous image data points in imagespace comprised within the selected planar surface; producing aplurality of non-contiguous memory addresses within the memory addressengine using the calculated set of contiguous image data points for usein a subsequent memory ‘READ’ operation; and collating thenon-contiguous memory addresses into the stream of memory ‘READ’addresses for output to a data memory to perform the subsequent memory‘READ’ operation.
 2. The method of claim 1, wherein generating theinitializing parameters enables viewing of the selected planar surfacerepresenting an image of a sectional view through the multiplanar image.3. The method of claim 2, wherein generating the initializing parametersfurther enables viewing of a portion of the multiplanar image.
 4. Themethod of claim 1, wherein the performing the algorithm within thememory address engine includes: calculating memory addressesprogressively for each image data point on a row in the selected planarsurface; and repeating the above calculation for the number of rows inthe selected planar surface.
 5. The method of claim 4, whereingenerating initializing parameters comprises: generating an initialspatial-coordinate value of an image data point on the selected planarsurface; generating a plurality of unit vectors defining unit increasesfor image data points comprised within each of the column and rows ofthe selected planar surface; and generating the spatial-coordinateboundary values defining the column and row limits for the requiredimage data on the selected planar surface; and wherein configuring thememory address engine comprises: specifying the spatial-coordinates ofthe initial image data point on the selected planar surface; specifyinga column unit-vector for defining a unit increase in each column of theselected planar surface; specifying a row unit-vector for defining aunit increase in each row of the selected planar surface; and specifyingthe column limits (ColLimitL, ColLimitH) and the row limits RowLimitL,RowLimitH) for the required image data on the selected planar surface.6. The method of claim 5, wherein specifying the spatial-coordinates ofthe initial image data point on the selected planar surface is carriedout using Cartesian coordinates (XInit, yInit, zInit).
 7. The method ofclaim 5, wherein specifying the column unit-vector and the rowunit-vector are carried out using Cartesian coordinates (ColIncr(x,y,z),RowIncr(x,y,z)).
 8. The method of claim 1, further comprising outputtingthe collated stream of memory addresses to a memory store.
 9. The methodof claim 1, wherein producing the plurality of non-contiguous memoryaddresses further comprises checking whether a generated memory addressis accessible to the memory address engine for the ‘READ’ operation. 10.The method of claim 9, wherein producing the plurality of non-contiguousmemory addresses further comprises comparing a generated memory addressto that of a predetermined range of memory addresses accessible to thememory address engine for the ‘READ’ operation.
 11. The method of claim9, wherein producing the plurality of non-contiguous memory addressesfurther comprises returning a null result if the generated memoryaddress is not accessible to the memory address engine for the ‘READ’operation.
 12. The method of claim 1, wherein a plurality of memoryaddress engines are arranged to be operated in a Single InstructionMultiple Data (SIMD) configuration and configuring the memory addressengine further comprises configuring at least one additional SIMD memoryaddress engine with the generated initializing parameters, and whereinthe algorithm is performed and the plurality of non-contiguous memoryaddresses are produced in each of the at least one additional SIMDmemory address engines.
 13. The method of claim 12, wherein producingthe plurality of non-contiguous memory addresses incorporatesdetermining whether a generated memory address is accessible to aneighboring SIMD memory address engine.
 14. The method of claim 12,wherein producing the plurality of non-contiguous memory addressesfurther comprises calculating a memory address on behalf of aneighboring SIMD memory address engine.
 15. The method of claim 14,further comprising routing data returned from a memory store associatedwith a respective SIMD memory address engine as though it had beenreturned from a respective memory store of a neighboring SIMD memoryaddress engine.
 16. The method of claim 15, wherein routing data furthercomprises excluding memory address data from being routed to theneighboring SIMD memory address engine.
 17. The method of claim 15,further comprising synchronizing the data transfer of the SIMD memoryaddress engine with that of its neighboring memory address engine.
 18. Amemory address engine adapted to accept initializing parameters set byan external controller, the memory address engine comprising: an addressgenerator arranged to generate a plurality of non-contiguous memoryaddresses according to at least one algorithm implemented on theinitializing parameters; and collation means arranged to collate theplurality of non-contiguous memory addresses into a stream of outputmemory ‘READ’ addresses for output to a data memory to perform asubsequent memory ‘READ’ operation; wherein the initializing parametersdefine an array of contiguous image data points in a selected planarsurface of a multiplanar image by a plurality of parameterization unitvectors and spatial-coordinate boundary values.
 19. The memory addressengine of claim 18, wherein the algorithm is implemented in hardware aspart of the address generator.
 20. The memory address engine of claim18, wherein the address generator comprises: means for calculatingmemory addresses progressively for each image data point on a row on theselected planar surface; and means for repeating the above calculationfor the number of rows in the selected planar surface.
 21. The memoryaddress engine of claim 18, wherein the address generator is arranged toprocess received initialization parameters describing thespatial-coordinate value of an initial image data point on the selectedplanar surface, a column unit-vector defining a unit increase in eachcolumn of the selected planar surface, a row unit-vector defining a unitincrease in each row of the selected planar surface, and thespatial-coordinate boundary values defining the column and row limitsfor the required image data.
 22. The memory address engine of claim 21,wherein the address generator is arranged to process at least someinitialization parameters specified in Cartesian coordinates.
 23. Thememory address engine of claim 18, further comprising access means fordetermining whether a generated memory address is accessible to thememory address engine for the ‘READ’ operation.
 24. The memory addressengine of claim 23, wherein the access means is arranged to compare agenerated memory address to that of a predetermined range of memoryaddresses accessible to the memory address engine for the ‘READ’operation.
 25. The memory address engine of claim 23, wherein the accessmeans is arranged to return a null result if the generated memoryaddress is not accessible to the memory address engine for the ‘READ’operation.
 26. The memory address engine of claim 23, wherein the accessmeans is arranged to determine whether a generated memory address isaccessible to a neighboring memory address engine.
 27. The memoryaddress engine of claim 26, wherein the access means is arranged tocalculate a memory address on behalf of a neighboring memory addressengine.
 28. The memory address engine of claim 18, wherein the addressgenerator comprises an associative processor.
 29. The memory addressengine of claim 18, wherein the address generator comprises an array ofassociative processors.
 30. The memory address engine of claim 28,wherein the associative processor(s) comprise one or more associativestring processors.
 31. A memory address engine adapted to receiveinitializing parameters set by an external controller, the initializingparameters defining an array of contiguous image data points in aselected planar surface within a multiplanar image, and the memoryaddress engine comprising: an address generator comprising an array ofassociative processors arranged in a Single Instruction Multiple Data(SIMD) configuration, the address generator adapted to generate aplurality of non-contiguous memory addresses according to at least onealgorithm implemented on received initializing parameters, and the atleast one algorithm configured in hardware components; wherein thereceived initialization parameters mathematically define an initial datapoint in the selected planar surface of the multiplanar image, a columnvector defining a unit increase in each column of the selected planarsurface, a row vector defining a unit increase in each row of theselected planar surface, and column and row limits for the required datain the selected planar surface; and wherein the at least one algorithmoperates to calculate the non-contiguous memory addresses progressivelyfor each data point on a row in the selected planar surface and repeatfor each row in the selected planar surface; collation means forcollating non-contiguous memory addresses into a stream of output memoryaddresses for output to a data memory; and access means for determiningwhether a generated memory address is accessible to either the memoryaddress engine or to a neighboring memory address engine; wherein theaccess means is adapted to calculate a memory address on behalf of aneighboring memory address engine.
 32. The memory address engine ofclaim 31, wherein the address generator is arranged to process one ormore initialization parameters specified in Cartesian coordinates.